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Abstract 

In this paper we consider a version of the zero-shot 
learning problem where seen class source and target do¬ 
main data are provided. The goal during test-time is to ac¬ 
curately predict the class label of an unseen target domain 
instance based on revealed source domain side information 
fe.g. attributes) for unseen classes. Our method is based on 
viewing each source or target data as a mixture of seen class 
proportions and we postulate that the mixture patterns have 
to be similar if the two instances belong to the same unseen 
class. This perspective leads us to learning source/target 
embedding functions that map an arbitrary source/target 
domain data into a same semantic space where similarity 
can be readily measured. We develop a max-margin frame¬ 
work to learn these similarity functions and jointly opti¬ 
mize parameters by means of cross validation. Our test re¬ 
sults are compelling, leading to significant improvement in 
terms of accuracy on most benchmark datasets for zero-shot 
recognition. 

1. Introduction 

While there has been significant progress in large-scale 
classification in recent years 131], lack of sufficient training 
data for every class and the increasing difficulty in finding 
annotations for a large fraction of data might impact further 
improvements. 

Zero-shot learning is being increasingly recognized as 
a way to deal with these difficulties. One version of zero 
shot learning is based on so-called source and target do¬ 
mains. Source domain is described by a single vector cor¬ 
responding to each class based on side information such 
as attributes 18, 16, 21, 25, 29], language words/phrases 
14, 9, 34], or even learned classifiers [42], which we as¬ 
sume can be collected easily. The target domain is described 
by a joint distribution of images/videos and labels 116, 41]. 
During training time, we are given source domain attributes 
and target domain data corresponding to only a subset of 
classes, which we call seen classes. During test time, source 
domain attributes for unseen (i.e. no training data provided) 



Figure 1. Proposed method with source/target domain data displayed on 
the leftmost/rightmost figures respectively. Light blue corresponds to un¬ 
seen classes and other colors depict seen class data. Light-blue data is 
unavailable during training. During test-time unseen source domain data 
is revealed along with an arbitrary unseen instance from target domain 
(light-blue) is presented and we are to identify its unseen class label. Each 
unseen class source domain data is expressed as a histograms of seen class 
proportions. Seen class proportions are estimated for the target instance 
and compared with each of the source domain histograms. 

classes are revealed. The goal during test time is to predict 
for each target domain instance which of the seen/unseen 
classes it is associated with. 

Key Idea: Our proposed method is depicted in Fig. 1. We 
view target data instances as arising from seen instances and 
attempt to express source/target data as a mixture of seen 
class proportions. Our algorithm is based on the postulate 
that if the mixture proportion from target domain is similar 
to that from source domain, they must arise from the same 
class. This leads us to learning source and target domain 
embedding functions using seen class data that map arbi¬ 
trary source and target domain data into mixture proportions 
of seen classes. 

We propose parameterized-optimization problems for 
learning semantic similarity embedding (SSE) functions 
from training data and jointly optimize predefined param¬ 
eters using cross validation on held-out seen class data. Our 
method necessitates fundamentally new design choices re¬ 
quiring us to learn class-dependent feature transforms be¬ 
cause components of our embedding must account for con¬ 
tribution of each seen class. Our source domain embed¬ 
ding is based on subspace clustering literature 137] that 
are known to be resilient to noise. Our target domain 
embedding is based on a margin-based framework using 
the intersection function or the rectified linear unit (ReLU) 
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[22], which attempts to align seen class source domain data 
with their corresponding seen class target domain data in¬ 
stances. Finally, we employ a cross validation technique 
based on holding out seen class data and matching held-out 
seen classes to optimize parameters used in the optimiza¬ 
tion problems for source and target domain. In this way we 
jointly optimize parameters to best align mixture propor¬ 
tions for held-out seen classes and provide a basis for gen¬ 
eralizing to unseen classes. Results on several benchmark 
datasets for zero-shot learning demonstrate that our method 
significantly improves the current state-of-the-art results. 
Related Work: Most existing zero-shot learning methods 
rely on predicting side information for further classification. 
[24] proposed a semantic {i.e. attribute) output code classi¬ 
fier which utilizes a knowledge base of semantic properties. 
[16, 39] proposed several probabilistic attribute prediction 
methods. [42] proposed designing discriminative category- 
level attributes. [18] proposed an optimization formulation 
to learn source domain attribute classifiers and attribute vec¬ 
tors jointly. [20] proposed learning the classifiers for un¬ 
seen classes by linearly combining the classifiers for seen 
classes. [1] proposed a label embedding method to embed 
each class into an attribute vector space. [2, 9, 23, 34] di¬ 
rectly learned the mapping functions between the feature 
vectors in source and target domains with deep learning. 
Such methods may suffer from noisy {e.g. missing or incor¬ 
rectly annotated) side information or data bias, leading to 
unreliable prediction. 

Some recent work has been proposed to overcome some 
issues above. [28] proposed a propagated semantic transfer 
method by exploiting unlabeled instances. [10] discussed 
the projection domain shift problem and proposed a trans- 
ductive multi-view embedding method. [14] investigated 
the attribute unreliability issue and proposed a random for¬ 
est approach. [30] proposed a simple method by introducing 
a better regularizes 

An important conceptual difference that distinguishes 
our method from other existing works such as [1, 2], is that 
these methods can be interpreted as learning relationships 
between source attributes and target feature components (in 
the encoded space), while our method is based on leverag¬ 
ing similar class relationships (semantic affinities) in source 
and target domains, requiring class dependent feature trans¬ 
form. This leads to complex scoring functions, which can¬ 
not be simplified to linear or bilinear forms as in [1, 2]. 

Semantic similarity embedding (SSE) is widely used to 
model the relationships among classes, which is quite in¬ 
sensitive to instance level noise. [40] proposed learning 
mapping functions to embed input vectors and classes into a 
low dimensional common space based on class taxonomies. 
[3] proposed a label embedding tree method for large multi¬ 
class tasks, which also embeds class labels in a low dimen¬ 
sional space. [12] proposed an analogy-preserving semantic 


Notation 

Definition 

S(U) 

Set of seen (unseen) classes 

l<5| 

Number of seen classes 

s (or y) Slu 

Indexes for seen and unseen classes 

aI‘5I 

Simplex in M 1*^1 dimensional space 

{Cy} 

Source domain attribute vector Cy G for class y 

with ^2 normalization, i.e. Cy = 1. 


Training data: G - target feature, yi - class 

NiNy) 

Number of training samples (for class y G S) 

'05 tt 

Source/Target domain feature embedding functions 


Target domain class dependent feature transformation 


The nth entry in vector {■)m 

Zy = 7/;(cy) 

Learned source domain embedded histogram Zy G 

A 1*^1 for class y. 

II 

Learned target domain reference vector G 

for class y, one vector per seen class 

w 

Learned target domain weight vector 

/(x, y) 

Learned structured scoring function relating the tar¬ 
get domain sample x and class label y. 


Table 1. Some notation used in our method. 


embedding method for multi-class classification. Later [13] 
proposed a unified semantic embedding method to incorpo¬ 
rate different semantic information into learning. Recently 
[23] proposed a semantic embedding method for zero-shot 
learning to embed an unseen class as a convex combination 
of seen classes with heuristic weights. [11] proposed a se¬ 
mantic ranking representation based on semantic similarity 
to aggregate semantic information from multiple heteroge¬ 
neous sources. Our embedding is to represent each class as 
a mixture of seen classes in both domains. 

2. Zero-Shot Learning and Prediction 

Our notation is summarized in Table 1 for future reference. 

2.1. Overview 

Our method is based on expressing source/target data as a 
mixture of seen class proportions (see Fig. 1). Using seen 
class data we learn source and target domain embedding 
functions, t/;, tt respectively. Our aim is to construct func¬ 
tions that take an arbitrary source vectors c and target vec¬ 
tors X as inputs and embed them into (histograms). 
Observe that components, 7ry(x), '^^(c) of 7r(x), '0(c), 
corresponding to seen class y e S, denote the proportion 
of class y in the instance x, c. During test-time source do¬ 
main vectors Cu ^ C for all the unseen classes are revealed. 
We are then presented with an arbitrary target instance x. 
We predict an unseen label for x by maximizing the seman¬ 
tic similarity between the histograms. Letting Zu = 0(cu), 
then our zero-shot recognition rule is defined as follows: 

14* = argmax/(x, ?x) = arg max(7r(x), z^^), (1) 

uEU u£U 

where (•, •) denotes the inner product of two vectors. 

We propose parameterized-optimization problems to 
learn embedding functions from seen class data. We then 






optimize these parameters globally using held-out seen 
class data. We summarize our learning scheme below. 

(A) Source Domain Embedding Function ('ip): Our embed¬ 
ding function is realized by means of a parameterized opti¬ 
mization problem, which is related to sparse coding. 

(B) Target Domain Embedding Function (ir): We model 
7 r^(x) as (w,(/)^(x)). This consists of a constant weight 
vector w and a class dependent feature transformation 

(x). We propose a margin-based optimization problem to 
jointly learn both the weight vector and the feature transfor¬ 
mation. Note that our parameterization may yield negative 
values and may not be normalized, which can be incorpo¬ 
rated as additional constraints but we ignore this issue in our 
optimization objectives. 

(C) Cross Validation: Our embedding functions are param¬ 
eter dependent. We choose these parameters by employing 
a cross validation technique based on holding out seen class 
data. First, we learn embedding functions (see (A) and (B)) 
on the remaining (not held-out) seen class data with differ¬ 
ent values of the predefined parameters. We then jointly op¬ 
timize parameters of source/target embedding functions to 
minimize the prediction error on held-out seen classes. In 
the end we re-train the embedding functions over the entire 
seen class data. 

Salient Aspects of Proposed Method: 

(a) Decomposition: Our method seeks to decompose source 
and target domain instances into mixture proportions of 
seen classes. In contrast much of the existing work can 
be interpreted as learning cross-domain similarity between 
source domain attributes and target feature components. 

(b) Class Dependent Feature Transformation 7ry(x): The 
decomposition perspective necessitates fundamentally new 
design choices. For instance, 7ry(x), the component corre¬ 
sponding to class y must be dependent on y, which implies 
that we must choose a class dependent feature transform 
(pyix) because w is a constant vector and agnostic to class. 

(c) Joint Optimization and Generalization to Unseen 
Classes: Our method jointly optimizes parameters of the 
embedding functions to best align source and target domain 
histograms for held-out seen classes, thus providing a basis 
for generalizing to unseen classes. Even for fixed param¬ 
eters, embedding functions tt are nonlinear maps and 
since the parameters are jointly optimized our learned scor¬ 
ing function /(x, y) couples seen source and target domain 
together in a rather complex way. So we cannot reduce 
/(•,•) to a linear or bilinear setting as in [ 2 ]. 


precisely, letting P and Py be the unseen and seen class- 
conditional target feature distributions respectively, we can 
a priori approximate P as a mixture of the P^’s, i.e. P = 
^yes + terror (see [5] for various approaches in this 
context), where ny denotes the mixture weight for class y. 
Analogously, we can also decompose source domain data 
as a mixture of source domain seen classes. This leads us 
to associate mixture proportion vector Zu with unseen class 
u, and represent attribute vector Cu as Cu ~ ^yes ^u,yCy, 
withz^ = {zu,y)yes ^ 

Key Postulate: The target domain instance, x, must have 
on average a similar mixture pattern as the source domain 
pattern if they both correspond to the same unseen label, 
u eU, namely, on average 7r(x) is equal to Zu. 

This postulate is essentially Eq. 1. This postulate also 
motivates our margin-based approach for learning w. Note 
that since we only have a single source domain vector for 
each class, a natural constraint is to require that the empiri¬ 
cal mean of the mixture corresponding to each example per 
class in target domain aligns well with the source domain 
mixture. This is empirically consistent with our postulate. 
Letting y,y' be seen class labels with y ^ y' and ity de¬ 
note the average mixture for class y in target domain, our 
requirement is to guarantee that 

{ity^Zy) > (fty^Zyl) ( 2 ) 

£ hyi=s}M^i) ) py,s - V,s) > 
ses \ ^ i=i / 

Emp. Mean Embedding 

where denotes a binary indicator function returning 1 
if the condition holds, otherwise 0. Note that the empirical 
mean embedding corresponds to a kernel empirical mean 
embedding [33] if fs is a valid (characteristic) RKHS ker¬ 
nel, but we do not pursue this point further in this paper. 
Nevertheless this alignment constraint is generally insuffi¬ 
cient, because it does not capture the shape of the underly¬ 
ing sample distribution. We augment misclassification con¬ 
straints for each seen sample in SVMs to account for shape. 

2.3. Source Domain Embedding 

Recall from Eig. 1 and (B) in Sec. 2.1 that our embed¬ 
ding aims to map source domain attribute vectors c to his¬ 
tograms of seen class proportions, i.e. ^ We 

propose a parameterized optimization problem inspired by 
sparse coding as follows, given a source domain vector c: 


2.2. Intuitive Justification of Proposed Method 

Recall that our method is based on viewing unseen source 
and target instances as a histogram of seen classes propor¬ 
tions. Eig. 1 suggests that a target instance can be viewed 
as arising from a mixture of seen classes with mixture com¬ 
ponents dependent on the location of the instance. More 


'0(c) = argmin 



, (3) 


where 7 > 0 is a predefined regularization parameter, || • || 
denotes the ^2 norm of a vector, and a = ((ay)^^^ de¬ 
scribes contributions of different seen classes. Note that 
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(a) Seen classes in training (b) Unseen classes in testing 

Figure 2. Cosine similarity matrices among (a) seen and (b) unseen 
classes on aPascal & aYahoo [8] dataset. Brighter color depicts larger val¬ 
ues. The type of data used to compute the matrix is shown above the corre¬ 
sponding matrix. Observe that in training/testing our source/target domain 
embedding preserves the inter-class relationships originally defined by the 
source domain attribute vectors. This also indicates that our target domain 
embeddings manage to align well the target domain distributions with the 
source domain attribute vectors. 


even though c may not be on the simplex, the embeddings 
V^(c) are always. Note that the embedding 7 /; is in general a 
nonlinear function. Indeed on account of simplex constraint 
small values in ol vector are zeroed out {i.e. “water-filling”). 

To solve Eq. 3, we use quadratic programming. For 
large-scale cases, we adopt efficient proximal gradient de¬ 
scent methods. Note that there are many alternate ways of 
embedding such as similarity rescaling, subspace clustering 
[27], sparse learning [ /], and low rank representation [17], 
as long as the embedding is on the simplex. We tried these 
different methods with the simplex constraint to learn the 
embeddings, and our current solution in Eq. 3 works best. 
We believe that it is probably because the goal in these other 
methods is subspace clustering, while our goal is to find a 
noise resilient embedding which has good generalization to 
unseen class classification. 

We optimize the parameter, 7 , globally by cross vali¬ 
dation. Once the 7 parameter is identified, all of the seen 
classes are used in our embedding function. Note that when 
7 = 0 or small, V^(cy) will be a coordinate vector, which 
essentially amounts to coding for multi-class classification 
but is not useful for unseen class generalization. Concep¬ 
tually, because we learn tuning parameters to predict well 
on held-out seen classes, 7 is in general not close to zero. 
We demonstrate class affinity matrices before and after em¬ 
bedding for both seen and unseen classes in Fig. 2. Here 
7 = 10 is obtained by cross validation. We see that in both 
training and testing source domain embeddings preserve the 
affinities among classes in the attribute space. 

During test-time when unseen class attribute vectors Cu 
are revealed, we obtain as the embeddings using Eq. 3 
with the learned 7 . 


2.4. Target Domain Embedding 

In this paper we define our target domain class depen¬ 
dent mapping function (py based on ( 1 ) intersection function 
(INT) [19], or (2) rectified linear unit (ReLU) [22]. That is, 

INT: ^j^(x) = inin(x,Vj^), (4) 

ReLU: (/)y(x) = max(0,x — Vy), (5) 

where min and max are the entry-wise operators. Note that 
intersection function captures the data patterns in x below 
the thresholds in each v^, while ReLU captures the data pat¬ 
terns above the thresholds. In this sense, the features gen¬ 
erated from these two functions are complementary. This is 
the reason that we choose the two functions to demonstrate 
the robustness of our method. 

Based on Eq. 1 and 2 in Section 2.1, we define the fol¬ 
lowing structured scoring function /(x, y) as follows: 

= Zy,s- ( 6 ) 

s^iS 

In test-time for target instance x, we can compute /(x, u) 
for an arbitrary unseen label u because the source attribute 
vector is revealed for u. Note that / is highly non-convex, 
and it cannot reduce to bilinear functions used in existing 
works such as [ 1 , 2 ]. 

2.4.1 Max-Margin Formulation 

Based on Eq. 6 , we propose the following parameter¬ 
ized learning formulation for zero-shot learning as follows, 
which learns the embedding function tt, and thus /: 

min ^||w ||2 + ^ y] l|v|P + 

V,w,^,€ Z Z ^^^' 

vGV y,s i,y 

(7) 

s.t. Vi G { 1 , • • • , A^}, G iS, Vs G 5, 

^ IT 

E Hyi=y} 

Ny 

2=i ^ 

/(xi, yi) - /(xj, y) > A{yi, y) - ^iy, (9) 

eys > 0, ^iy > 0, Vv G V, V > 0, 

where A(-, •) denotes a structural loss between the ground- 
truth class and the predicted class, Ai > 0, A 2 > 0, and 
A 3 > 0 are the predefined regularization parameters, ^ = 
{^iy} and e = are slack variables, and 0 is a vector 

of O’s. In this paper, we define A{yi,y) = 1 — Cy.Cy and 
A(^, s) = 1 — CyCs, respectively. Note that in learning we 
only access and utilize the data from seen classes. 

In fact, Eq. 8 measures the alignment loss for each seen 
class distribution, and Eq. 9 measures the classification loss 
for each target domain training instance, respectively, which 


f{xi,y) - f{xi,s) >A{y,s)-e 


ys^ 


( 8 ) 






















(a) Distr. alignment (b) Inst, classification (c) Our method 

Figure 3. Illustration of three different constraints for learning the target 
domain semantic embedding function. Different shapes denote differnt 
classes, hll-in shapes denote the source domain embeddings, and green 
crosses denote the empirical means of target domain data embeddings. Our 
method takes into account the zero-shot learning based on both distribution 
alignment and instance classification. 

correspond to the discussion in Sec. 2.2. On one hand, if 
we only care about the alignment condition, it is likely that 
there may be many misclassified training data samples (i.e. 
loose shape) as illustrated in Fig. 3(a). On the other hand, 
conventional classification methods only consider separat¬ 
ing data instances with tight shape, but are unable to align 
distributions due to lack of such constraint in training (see 
Fig. 3(b)). By introducing these two constraints into Eq. 7, 
we are able to learn the target domain embedding function 
as well as the scoring function to produce the clusters which 
are well aligned and separated, as illustrated in Fig. 3(c). 

Similarly, we learn the predefined parameters Ai, A 2 , A 3 
through a cross validation step that optimizes the prediction 
for held-out seen classes. Then once the parameters are de¬ 
termined we re-leam the classifier on all of the seen data. 
Fig. 2 depicts class affinity matrices before and after tar¬ 
get domain semantic embedding on real data. Our method 
manages to align source/target domain data distributions. 

2.4.2 Alternating Optimization Scheme 

To solve Eq. 7, we propose the following alternating opti¬ 
mization algorithm, as seen in Alg. 1. 


Algorithm 1 Teaming Embedding Functions 

Input : {xi,yi}, {cy}y^s, {^y}yes, Ai, A 2 , A 3 , learning rate ry > 
0 

Initialize with feature means of seen classes in target domain; 

for f = 0 to r do 

(w, €, ^) ^ linearSVM_solver({xi, y^}, , A 2 , A 3 ); 

Check monotonic decreasing condition on the objective in Eq. 7; 

end 

Output : w, iz 


(i) Learning v^ by fixing V: In this step, we can col¬ 
lect all the constraints in Eq. 8 and Eq. 9 by plugging 
in {(x^, ^i)}, V, {cy}y^s, and then solve a linear SVM to 
learn w, e, respectively. 

(ii) Learning V by fixing v^ using Concave-Convex pro¬ 
cedure (CCCP) [43]: Note that the constraints in Eq. 8 and 


Eq. 9 consist of difference-of-convex (DoC) functions. To 
see this, we can rewrite /(x^, ^) — /(x^, ^^) as a summation 
of convex and concave functions as follows: 


/(Xi, y) - f{Xi,yi) = y] Wm{Zy,n - 
m,s 

( 10 ) 

where Wm and (j)s,m{') denote the mth entries in vectors w 
and 0s('), respectively. Let u G be a vector con¬ 

catenation of all v’s, giiy) = ^i(x^,^,r/) and g 2 {y) = 
^ 2 (xi,^,iz) denote the summations of all the convex and 
all the concave terms in Eq. 10, respectively. Then we have 
f{^i,y) - f{^i,yi) = fli(i') - {-92{ v )), i.e. DoC func- 
tions. Using CCCP we can relax the constraint in Eq. 9 as 
iiy > 

where denotes the solution for iz in iteration t, and V 
denotes the subgradient operator. Similarly we can perform 
CCCP to relax the constraint in Eq. 8 . Letting h{u) denote 
the minimization problem in Eq. 7, 8 , and 9, using CCCP 
we can further write down the subgradient V/i(iz^^^) in iter¬ 
ation t + 1 as follows: 


+ A 2 y3ll{e^3>0,yi=y} 


+ 




( 11 ) 


Then we use subgradient descent to update iz, equivalently 
learning V. With simple algebra, we can show that the mth 
entry for class n in V^i(iz^^^) + is equivalent to 


the mth entry in 


At) 


dfixi,yi) 

dvs 


At) 


In order to 


guarantee the monotonic decrease of the objective in Eq. 7, 
we add an extra checking step in each iteration. 


2.5. Cross Validation on Seen Class Data 


The scoring function in Eq. 6 is obtained by solving 
Eq. 3 and 7, which in turn depend on parameters 0 = 
(7, Ai, A 2 , A 3 ). We propose learning these parameters by 
means of cross validation using held-out seen class data. 
Specifically, define Si C S and the held-out set Sh = S\Si. 
We learn a collection of embedding functions for source and 
target domains using Eq. 3 and 7 over a range of parame¬ 
ters 9 suitably discretized in 4D space. For each parameter 
choice 6 we obtain a scoring function, which depends on 
training subset as well as the parameter choice. We then 
compute the prediction error, namely, the number of times 
that a held-out target domain sample is misclassified for this 
parameter choice. We repeat this procedure for different 
randomly selected subsets Si and choose parameters with 
the minimum average prediction error. Once these parame¬ 
ters are obtained we then plug it back into Eq. 3 and 7, and 
re-learn the scoring function using all the seen classes. 













3. Experiments 

We test our method on five benchmark image datasets 
for zero-shot recognition, i.e. CIFAR-10 [15], aPascal & 
aYahoo (aP&Y) [8], Animals with Attributes (AwA) [15], 
Caltech-UCSD Birds-200-2011 (CUB-200-2011) [38], and 
SUN Attribute [26]. For all the datasets, we utilize MatCon- 
vNet [36] with the “imagenet-vgg-verydeep-19” pretrained 
model [32] to extract a 4096-dim CNN feature vector (i.e. 
the top layer hidden unit activations of the network) for 
each image (or bounding box). Very deep features work well 
since they lead to good class separation, which is required 
for our class dependent transform (see Fig. 5). Similar CNN 
features were used in previous work [2] for zero-shot learn¬ 
ing. We denote the two variants of our general method as 
SSE-INT and SSE-ReLU, respectively. Note that in terms 
of experimental settings, the main difference between our 
method and the competitors is the features. We report the 
top-1 recognition accuracy averaged over 3 trials. 

We set 7, A2, A3 G { 0 , 10 -^ 10 “^ 10 -\ 1 , 10 , 10 ^} in 
Eq. 3 and 7 for cross validation. In each iteration, we ran¬ 
domly choose two seen classes for validation, and fix iz in 
Alg. 1 to its initialization for speeding up computation. For 
Ai, we simply set it to a small number 10“^ because it is 
much less important than the others for recognition. 

3.1. CIFAR-10 

This dataset consists of 60000 color images with reso¬ 
lution of 32 X 32 pixels (50000 for training and 10000 for 
testing) from 10 classes. [34] enriched it with 25 binary 
attributes and 50-dim semantic word vectors with real num¬ 
bers for each class. We follow the settings in [34]. Precisely, 
we take cat-dog, plane-auto, auto-deer, deer-ship, and cat- 
truck as test categories for zero-shot recognition, respec¬ 
tively, and use the rest 8 classes as seen class data. Our 
training and testing is performed on the split of training and 
test data provided in the dataset, respectively. 

We first summarize the accuracy of [34] and our method 
in Table 2. Clearly our method outperforms [34] signifi¬ 
cantly, and SSE-INT and SSE-ReLU perform similarly. We 
observe that for cat-dog our method performs similarly as 
[34], while for others our method can easily achieve very 
high accuracy. We show the class affinity matrix in Fig. 
4(a) using the binary attribute vectors, and it turns out that 
cat and dog have a very high similarity. Similarly the word 
vectors between cat and dog provide more discrimination 
than attribute vectors but still much less than others. 

To better understand our SSE learning method, we visu¬ 
alize the target domain CNN features as well as the learned 
SSE features using t-SNE [35] in Eig. 4(b-d). Due to dif¬ 
ferent seen classes, the learned functions and embeddings 
for Eig. 4(c) and Eig. 4(d) are different. In Eig. 4(b), 
CNN features seem to form clusters for different classes 
with some overlaps, and there is a small gap between “an- 
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(c) SSE embeddings (auto-deer) (d) SSE embeddings (cat-dog) 

Figure 4. (a) Class affinities for the 10 classes using source domain binary 
attribute vectors, (b-d) t-SNE visualization of different features with 25 at¬ 
tributes, where 100 samples per class in the test set are selected randomly 
and the same color denotes the same class, (b) shows the 4096-dim original 
target domain CNN features, (c) and (d) show the 8-dim learned SSE fea¬ 
tures by SSE-INT and tested on auto-deer and cat-dog, respectively. The 
embeddings produced by SSE-ReLU have similar patterns. 


imals” and “artifacts”. In contrast, our SSE features are 
guided by source domain attribute vectors, and indeed pre¬ 
serve the affinities between classes in the attribute space. In 
other words, our learning algorithm manages to align the 
target domain distributions with their corresponding source 
domain embeddings in SSE space, as well as discriminating 
each target domain instance from wrong classes. As we see, 
the gaps between animals and artifacts are much clearer in 
Eig. 4(c) and Eig. 4(d) than that in Eig. 4(b). Eor cat and 
dog, however, there is still a large overlap in SSE space, 
leading to poor recognition. The overall sample distribu¬ 
tions in Eig. 4(c) and Eig. 4(d) are similar, because they 
both preserve the same class affinities. 

3.2. Other Benchmark Comparison 

Eor the detail of each dataset, please refer to its origi¬ 
nal paper. Eor aP&Y, CUB-200-2011, and SUN Attribute 
datasets, we take the means of attribute vectors from the 
same classes to generate source domain data. Eor AwA 
dataset, we utilize the real-number attribute vectors since 
they are more discriminative. 

We utilize the same training/testing splits for zero-shot 
recognition on aP&Y and AwA as others. Eor CUB-200- 
2011, we follow [1] to use the same 150 bird spices as seen 
classes for training and the left 50 spices as unseen classes 
for testing. Eor SUN Attribute, we follow [14] to use the 









Table 2. Zero-shot recognition accuracy comparison (%, meanistandard deviation) on CIFAR-10. The compared numbers are best estimated from Fig. 3 
in [34]. Notice that all the methods here utilize deep features to represent images in target domain. 


Method 

cat-dog 

plane-auto 

auto-deer 

deer-ship 

cat-truck 

Average 

Socher et al. [34] (50 words) 

50 

65 

76 

83 

90 

72.8 

SSE-INT (50 words) 

SSE-ReLU (50 words) 

59.00±0.57 

58.78±1.60 

91.62±0.19 

91.33±0.53 

97.95±0.13 

97.33±0.28 

95.73±0.08 

95.37±0.29 

97.20±0.05 

97.32±0.12 

88.30 

88.03 

SSE-INT (25-dim binary vectors) 
SSE-ReLU (25-dim binary vectors) 

48.47±0.08 

48.52±0.13 

93.93±0.59 

93.68±0.73 

99.07±0.18 

98.48±0.15 

96.03±0.03 

95.32±0.25 

96.92±0.14 

96.43±0.06 

86.88 

86.49 


Table 3. Zero-shot recognition accuracy comparison (%) on aP&Y, AwA, CUB-200-2011, and SUN Attribute, respectively, in the form of meanistandard 
deviation. Here except our results, the rest numbers are cited from their original papers. Note that some experimental settings may differ from ours. 


Feature 

Method 

aPascal & aYahoo 

Animals with Attributes 

CUB-200-2011 

SUN Attribute 


Farhadi et al. [8] 

32.5 





Mahajan et al. [18] 

37.93 





Wang and Ji [39] 

45.05 

42.78 




Rohrbach et al. [28] 


42.7 




Yuetal. [42] 


48.30 



Non-CNN 

Akata et al. [1] 


43.5 

18.0 



Fu et al. [10] 

Mensink et al. [20] 


47.1 

14.4 



Lamport et al. [16] 

19.1 

40.5 


52.50 


Jayaraman and Grauman [14] 

26.02±0.05 

43.01±0.07 


56.18±0.27 


Romera-Paredes and Torr [30] 

27.27±1.62 

49.30±0.21 


65.75±0.51 

AlexNet 

Akata et al. [2]^ 


61.9 

40.3 



Lamport et al. [16] 

38.16 

57.23 


72.00 

vgg-verydeep-19 

Romera-Paredes and Torr [30] 
SSE-INT 

24.22±2.89 

44.15±0.34 

75.32±2.28 

71.52±0.79 

30.19±0.59 

82.10±0.32 

82.17±0.76 


SSE-ReLU 

46.23±0.53 

76.33±0.83 

30.41±0.20 

82.50±1.32 


^The results listed here are the ones with 4096-dim CNN features and the continuous attribute vectors provided in the datasets for fair comparison. 





(a) decaf 



(b) verydeep-19 


Figure 5. t-SNE visualization comparison of SSE distributions using the 
two CNN features on AwA testing data. Our method works well if there is 
good separation for classes and verydeep features are particularly useful. 


same 10 classes as unseen classes for testing (see their sup¬ 
plementary file) and take the rest as seen classes for training. 

We summarize our comparison in Table 3, where the 
blank spaces indicate that the proposed methods were not 
tested on the datasets in their original papers. Still there is 
no big performance difference between our SSE-INT and 
SSE-ReLU. On 4 out of the 5 datasets, our method works 
best except for CUB-200-2011. On one hand, [2] specif¬ 
ically targets at fine-grained zero-shot recognition such as 
this dataset, while ours aims for general zero-shot learning. 
On the other hand, we suspect that the source domain pro¬ 
jection function may not work well in fine-grained recogni¬ 
tion, and we will investigate more on it in our future work. 

To understand our method better with different features. 


we test 7 features on AwA dataset^ We show the SSE dis¬ 
tribution comparison using decaf CNN features and vgg- 
verydeep-19 CNN features in Eig. 5. There is a large dif¬ 
ference between the two distributions: (a) while with decaf 
features clusters are slightly separated they are still cluttered 
with overlaps among different classes, (b) vgg-verydeep-19 
features, in contrast, form crisp clusters for different classes, 
which is useful for zero-shot recognition. Also we plot the 
cosine similarity matrices created using different features 
in Eig. 6. As we see, the matrix from vgg-verydeep-19 fea¬ 
tures {i.e. the last) is the most similar to that from the source 
domain attribute vectors {i.e. the first). This demonstrates 
that our learning method with vgg-verydeep-19 features can 
align the target domain distribution with the source domain 
attribute vectors. We can attribute this to the fact that we 
need a class dependent feature transform 0y(x) that has 
good separation on seen classes. 

Our implementation^ is based on unoptimized MATLAB 
code. However, it can return the prediction results on any of 
these 5 datasets within 30 minutes using a multi-thread CPU 
(Xeon E5-2696 v2), starting from loading CNN features. 
Eor instance, on CIEAR-10 we manage to finish running the 
code less than 5 minutes. 

^We downloaded these features from http : //attributes . kyb . 
tuebingen.mpg.de/ 

^Our code is available at https : //zimingzhang. wordpress . 
com/source-code/. 






















(a) attributes (b) cq-hist (31.5) (c) Iss-hist (30.3) (d) rgsift-hist (33.6) (e) sift-hist (29.8) (f) surf-hist (36.5) (g) decaf (52.0) (h) very deep-19 (71.5) 

Figure 6. Cosine similarity matrices created using different features on AwA testing data. The numbers in the brackets are the mean accuray (%) achieved 
using the corresponding features. Our learning method performs the best with vgg-verydeep-19 features. We can attribute this to the fact that we need a 
class dependent feature transform (x) that has good separation on seen classes. 



(a) Recognition on unseen classes (b) Recognition on all classes 
Figure 7. Large-scale zero-shot recognition on SUN Attribute. 


3.3. Towards Large-Scale Zero-Shot Recognition 

We test the generalization ability of our method on the 
SUN Attribute dataset for large-scale zero-shot recognition. 
We design two experimental settings: (1) Like in bench¬ 
mark comparison, we randomly select M classes as seen 
classes for training, and then among the rest 717 — M 
classes, we also randomly select 10,20,''' ,717 — M 
classes as unseen classes for testing; (2) We randomly select 
10, 20, • • • ,700 classes as seen classes for training, and cat¬ 
egorize each data sample from the rest unseen classes into 
one of the 717 classes. Fig. 7 shows our results, where (a) 
and (b) correspond to the settings (1) and (2), respectively. 

In Fig. 7(a), we can see that with very few seen classes, 
we can achieve reasonably good performance when unseen 
classes are a few. However, with the increase of the number 
of unseen classes, the curve drops rapidly and then changes 
slowly when the number is large. From 200 to 700 unseen 
classes, our performance is reduced from 8.62% to 2.85%. 
With the increase of the number of seen classes, our perfor¬ 
mance is improving, especially when the number of unseen 
classes is small. With 10 unseen classes, our performance 
increases from 61.00% to 87.17% using 17 and 317 seen 
classes, respectively. But such improvement is marginal 
when there are already a sufficient number of seen classes, 
for instance from 217 to 317 seen classes. 

In Fig. 7(b), generally speaking, with more seen classes 
our performance will be better, because there will be bet¬ 
ter chance to preserve the semantic affinities among classes 
in source domain. With only 10 seen classes, our method 
can achieve 1.59% mean accuracy, which is much better 


than the random chance 0.14%. Notice that even though we 
use all the 717 classes as seen classes, we cannot guaran¬ 
tee that the testing results are similar to those of traditional 
classification methods, because the source domain attribute 
vectors will guide our method for learning. If they are less 
discriminative, e.g. the attribute vectors for cat and dog in 
CIFAR-10, the recognition performance may be worse. 

To summarize, our method performs well and stably on 
SUN Attribute with a small set of seen classes and a rela¬ 
tively large set of unseen classes. Therefore, we believe that 
our method is suitable for large-scale zero-shot recognition. 

4. Conclusion 

We proposed learning a semantic similarity embedding 
(SSE) method for zero-shot recognition. We label the se¬ 
mantic meanings using seen classes, and project all the 
source domain attribute vectors onto the simplex in SSE 
space, so that each class can be represented as a proba¬ 
bilistic mixture of seen classes. Then we learn similarity 
functions to embed target domain data into the same se¬ 
mantic space as source domain, so that not only the empir¬ 
ical mean embeddings of the seen class data distributions 
are aligned with their corresponding source domain embed¬ 
dings, but also the data instance itself can be classified cor¬ 
rectly. We propose learning two variants using intersection 
function and rectified linear unit (ReLU). Our method on 
five benchmark datasets including the large-scale SUN At¬ 
tribute dataset significantly outperforms other state-of-art 
methods. As future work, we would like to explore other 
applications for our method such as person re-identification 
[44, 45, 46] and zero-shot activity retrieval [6]. 
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